Project Group 30¶

Members (Student numbers):

  1. Rody Boting (5421446)
  2. Hugo Odijk (4945824)
  3. Tim de Ridder (4937961)
  4. Wouter van der Veer (4708121)
  5. Jesse Zegeling (4717309)

Research Objective¶

One of the major news items in the Netherlands for the past year: the very long queues at Schiphol Airport due to staff shortage. Delays are part of the most annoying aspects associated with flying. There are many reasons for delays; from airport capacity problems due to staff shortages to bad weather. And if it gets even worse, your flight might be cancelled.

Therefore, it is very interesting to look further into the air traffic delays. To analyse this, a database with all flights and delays in the United States in 2015 was found. This database will be investigated on the basis of the following research question:

What were the main air traffic delays for domestic flights in the United States over 2015?

To answer this research question, several subquestions will be answered as well:

  • What are the main causes of delay?
  • To which airline are the most delays linked?
  • Which airports encounter the most delay?
  • How do delays change over the year/week per airport?
  • Is there a correlation between weather and delay?

Schiphol Airport is one of the bigger airports in the world and it shows some really big delays in the last months. Therefore, the hypothesis is that the biggest airports, have the biggest delays as well. Because of the fact the flights from and to those airports are operated by the biggest airline carriers, the hypothesis is that they experience the biggest delays too. Furthermore, it is expected that weather has a big influence on the delays and to test this hypothesis this is further investigated.

Contribution Statement¶

  • Author 1 (Rody): Coding and text of the weather part (including visualisations); background research on the weather database
  • Author 2 (Hugo): Coding on the data modelling of several parts; coding of the main notebook (including creating the subnotebooks and main notebook); background research on three main databases
  • Author 3 (Tim): Coding of the maps of the part on airlines and airports and coding of the general part (including visualisations); background research on three main databases
  • Author 4 (Wouter): Coding and text of the parts on airlines and airports (including visualisations)
  • Author 5 (Jesse): Coding and text of the weather part (including visualisations); background research on the weather database

Data import and processing¶

The import of external data and the processing of these data is done in separate notebooks. In this paragraph, the DataFrames are imported from the other notebooks in order to show the visualisations in this report.

Four external databases are used for this report:

  1. Database with all flights within the United States in 2015. The database contains basic information like the date, airline and the airports connected by the flight. Furthermore, the scheduled and actual departure/arrival times are known through the database as well as the delays and the reasons behind these delays.
  2. Database with all airlines that operated the flights in database 1.
  3. Database with all airports in the United States in 2015, including the corresponding IATA-code and coordinates.
  4. Database with the weather data for several airports in the United States in 2015.
In [1]:
# First of all, importing the packages
%matplotlib notebook
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sb

import plotly.express as px
import plotly.graph_objects as go
from geopy.geocoders import Nominatim
from plotly.subplots import make_subplots
import matplotlib as mpl
import calmap
import calplot
from urllib.request import urlopen
from plotly.offline import init_notebook_mode
import json
import itertools
import plotly.io as pio
import geopandas as gpd
init_notebook_mode(connected=True)
pio.renderers.default = "plotly_mimetype+notebook"
In [2]:
# Data import for "General conclusions on delay" from "Group30_NB_General"
Sample_Data = pd.read_pickle("./sample_data")

df_airports = pd.read_pickle("./airports")
df_flights_6hours = pd.read_pickle("./flights_6hours")
df_flights_24hours = pd.read_pickle("./flights_24hours")
In [3]:
# Data import for "Airline delays" from "Group30_NB_Airline1"
df_airline_names = pd.read_pickle("./Airline_Name")
df_Airline = pd.read_pickle("./df_Airline")
Delay_Airline = pd.read_pickle("./Delay_Airline")
In [4]:
# Data import for "Airport delays" from "Group30_NB_Airport1"
df_flights_ATL = pd.read_pickle("./flights_ATL")
df_flights_SEA = pd.read_pickle("./flights_SEA")
In [5]:
# Data import for "Airline delays" from "Group30_NB_Airline_Airport2"
Months_Airline = pd.read_pickle("./M_Airline.pkl")
Days_Airline = pd.read_pickle("./D_Airline.pkl")
Months_Airport = pd.read_pickle("./M_Airport.pkl")
Year_DP = pd.read_pickle("./Y_Airport.pkl")
In [6]:
# Data import for "Weather delays" from "Group30_NB_Weather"
df_location = pd.read_pickle("./weather_location")

df_KSEA = pd.read_pickle("./weather_KSEA")
df_KHOU = pd.read_pickle("./weather_KHOU")

df3_SEA = pd.read_pickle("./weather_df3SEA")
df2_HOU = pd.read_pickle("./weather_df2HOU")
df3_HOU = pd.read_pickle("./weather_df3HOU")
df4_HOU = pd.read_pickle("./weather_df4HOU")

Data visualisation and analysis¶

Introduction¶

First of all, some general information obtained from the flights database will be discussed, beginning with the flight distribution over the week. For this, a random sample of 200,000 flights out of the database is taken. What can be seen is that there is an even spread over the week, only Saturday is somewhat behind the other days, with a lower amount of flights.

In [7]:
### Plot flights distribution over the week
fig1 = plt.figure()
default_color = sb.color_palette()[0]
sb.countplot(data = Sample_Data, x = 'DAY_OF_WEEK', color = default_color)

plt.xticks(rotation=30)
plt.xlabel('Day of the week')
plt.ylabel('Count')
plt.title('Flights distribution over the week');

Next to this, the flight duration can be investigated as well. When looking at the distribution of flight times over the same sample as used for the graph above, it is visible that most of the flights have a duration of 50 to 100 minutes (around an hour), with some outliers from more than 700 minutes (11 hours).

In [8]:
### Standard-scaled plot of flight duration distribution
fig2 = plt.figure()
binsize = 10
bins = np.arange(0, Sample_Data['AIR_TIME'].max()+binsize, binsize)

plt.hist(data = Sample_Data, x = 'AIR_TIME', bins = bins)

plt.xlabel('Air Time [min]')
plt.ylabel('Count')
plt.title('Standard-scaled plot of flight duration distribution');
plt.show()

The delays of flights can be caused by many different factors. In the used database these factors are categorised in four categories:

  • Weather Delay, for example delay due to strong winds in a storm.
  • Air System Delay, for example delays due to air traffic capacity problems (too much air traffic volume)
  • Security Delay, for example when an aircraft cannot leave the airport due to security reasons.
  • Late Aircraft Delay, for example there is a delay because the aircraft arrived late from the previous flight.

The heatmap is a correlation matrix which shows the correlation between these categories, the air time and the flight distance.

Here one can see that expected correlation between distance and air time is indeed there. Furthermore, the negative correlation between air system delay and late aircraft delay shows a potential correlation. For the other delays there is no clear correlation from the sample data.

In [9]:
### Heatmap with correlations between delays
fig3 = plt.figure()
variables = ['DEP_DELAY_POS', 'AIR_TIME', 'DISTANCE', 'WEATHER_DELAY',
             'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']

sb.heatmap(Sample_Data[variables].corr(), annot = True, fmt = '.3f',
           cmap = 'mako_r', center = 0)
plt.xticks(rotation=15)
plt.title('Heatmap of correlation between delay types')
plt.show()
C:\Users\woute\AppData\Local\Temp\ipykernel_14908\2047504405.py:6: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

The next plot shows the distribution of these four categories of delays plus departure and airline delay. The distribution is shown for the five busiest airports in the United States. Here one can see some similar results for all five airports for departure delay, air system delay and late aircraft delay. What sets apart the airports is the weather delay. This can be caused by the geographic location of the airports.

In [10]:
### Delays for the five busiest airports in USA per type of delay
Airport_5 = Sample_Data['ORIGIN_AIRPORT'].value_counts().head(5).keys()
df_Airport_5 = Sample_Data[Sample_Data['ORIGIN_AIRPORT'].isin(Airport_5)]
order = df_Airport_5.groupby(by=['ORIGIN_AIRPORT'])['DEP_DELAY_POS'].mean().sort_values(ascending=False).index

yvariables = ['DEP_DELAY_POS', 'AIRLINE_DELAY', 'WEATHER_DELAY', 
              'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']

plt.subplots(2,3,figsize=(12,7))
count = 1
default_color = sb.color_palette()[1]

for yvar in yvariables:
    plt.subplot(2,3, count)
    sb.boxplot(x= "ORIGIN_AIRPORT", y=yvar, data=df_Airport_5, color = default_color)
    plt.xlabel('ORIGIN_AIRPORT')
    plt.ylabel(f'{yvar} in min')
    plt.title(f'Distribution of {yvar}')
    plt.tight_layout()
    count +=1

What can be seen in the plots above, is that the airline delay is slightly more significant than the departure delay, for these five airports. When looking at the four categories of delay, it is noticeable that Air System Delay and Late Aircraft Delay are higher than the Weather Delay and Security Delay. Especially the Security delay shows a very clear picture that it the length of the delay is very low, with almost no variation in the distribution. Air System Delay and Late Aircraft Delay show a lot more variation on the other hand.

Departure delay¶

When looking at the departure delay of the flights in the sample some noticeable things can be observed as well. In the standard-scaled plot, it can be seen that by far the most flights had a delay of 0-10 minutes, so most of the flights run fairly on time or with forseeable amount of delay. However, on the other hand, there were some flights with a departure delay of around 300 minutes (5 hours).

The log-scaled plot gives a better view for the distribution of the delays of less than 100 minutes. The log bin also shows that for atleast half the flights the delay is not more than 30 minutes.

In [11]:
### Standard-scaled plot of departure delay distribution
fig5 = plt.figure()
bin_size = 10
bins = np.arange(0, Sample_Data['DEP_DELAY_POS'].max()+bin_size, bin_size)

plt.hist(data = Sample_Data, x = 'DEP_DELAY_POS', bins = bins)

plt.xlabel('Departure Delay [min]')
plt.ylabel('Count')
plt.title('Standard-scaled plot of departure delay distribution')
plt.xlim([0, 500])
plt.show()
In [12]:
### Log-scaled plot of departure delay distribution
fig6 = plt.figure()
log_bin_size = 0.2
bins = 10 ** np.arange(0, np.log10(Sample_Data['DEP_DELAY_POS'].max())+log_bin_size, log_bin_size)

plt.hist(data = Sample_Data, x = 'DEP_DELAY_POS', bins = bins)

plt.xscale('log')
plt.xticks([1, 4, 10, 40, 100, 400, 1000], ['1', '4','10', '40', '100', '400', '1000'])
plt.xlabel('Departure Delay [min]')
plt.ylabel('Count')
plt.title('Log-scaled plot of departure delay distribution')
plt.show()

Flights with arrival delay¶

Next to the departure delay, flights can have an arrival delay as well. The following two maps show all flights of 2015, which had an arrival delay bigger than 6 hours and 24 hours, respectively. Especially the first figure shows that there are a lot of flights with significant delays in 2015 and that there doesn't seem to be a certain airport or route that has more delay than others. The map with 24-hour delay flights shows a clearer picture, where one airport stands out: Dallas Fort Worth International Airport, which has a lot of lines connected. This airport is therefore susceptible to large delays.

In [13]:
### Plot flights with delay bigger than 6 hours
fig = go.Figure()

fig.add_trace(go.Scattergeo(
    locationmode = 'USA-states',
    lon = df_airports['LONGITUDE'],
    lat = df_airports['LATITUDE'],
    hoverinfo = 'text',
    text = df_airports['AIRPORT'],
    mode = 'markers',
    marker = dict(
        size = 2,
        color = 'rgb(255, 100, 100)',
        line = dict(
            width = 3,
            color = 'rgba(100, 100, 100, 0)'
        )
    )))

flight_paths = []
for i in range(len(df_flights_6hours)):
    fig.add_trace(
        go.Scattergeo(
            locationmode = 'USA-states',
            lon = [df_flights_6hours.iloc[i,5], df_flights_6hours.iloc[i,3]],
            lat = [df_flights_6hours.iloc[i,4], df_flights_6hours.iloc[i,2]],
            mode = 'lines',
            line = dict(width = 1,color = 'red'),
            opacity = 1,
        )
    )

fig.update_layout(
    title_text = f'Flights in 2015 with more than 6 hour delay',
    showlegend = False,
    geo = dict(
        scope = 'north america',
        projection_type = 'natural earth',
        showland = True,
        landcolor = 'rgb(200, 200, 200)',
        countrycolor = 'rgb(180, 180, 180)',
    ),
)

fig.show()
In [14]:
### Plot flights with delay bigger than 24 hours
fig = go.Figure()

fig.add_trace(go.Scattergeo(
    locationmode = 'USA-states',
    lon = df_airports['LONGITUDE'],
    lat = df_airports['LATITUDE'],
    hoverinfo = 'text',
    text = df_airports['AIRPORT'],
    mode = 'markers',
    marker = dict(
        size = 2,
        color = 'rgb(255, 100, 100)',
        line = dict(
            width = 3,
            color = 'rgba(100, 100, 100, 0)'
        )
    )))

flight_paths = []
for i in range(len(df_flights_24hours)):
    fig.add_trace(
        go.Scattergeo(
            locationmode = 'USA-states',
            lon = [df_flights_24hours.iloc[i,5], df_flights_24hours.iloc[i,3]],
            lat = [df_flights_24hours.iloc[i,4], df_flights_24hours.iloc[i,2]],
            mode = 'lines',
            line = dict(width = 1,color = 'red'),
            opacity = 1,
        )
    )

fig.update_layout(
    title_text = f'Flights in 2015 with more than 24 hour delay',
    showlegend = False,
    geo = dict(
        scope = 'north america',
        projection_type = 'natural earth',
        showland = True,
        landcolor = 'rgb(200, 200, 200)',
        countrycolor = 'rgb(180, 180, 180)',
    ),
)

fig.show()

Cancellation rates¶

Flights can not only have departure and/or arrival delays, they also can be cancelled completely. The first graph shows the cancellation rates for the 30 busiest airports in the United States. LaGuardia Airport leads the charts with around 4 percent of all flights being cancelled.

To see how the cancellation translate to the day in the week, below the flight cancellations per week day are displayed in descending order. What shows is that Monday has about twice as much cancellations as Wednesday, Friday and Saturday.

In [15]:
### Cancellation rates for the 30 busiest airports in USA
Airport_30 = Sample_Data['ORIGIN_AIRPORT'].value_counts().head(30).keys()
df_Airport_30 = Sample_Data[Sample_Data['ORIGIN_AIRPORT'].isin(Airport_30)]
order = df_Airport_30.groupby(by=['ORIGIN_AIRPORT'])['CANCELLED'].mean().sort_values(ascending=False)

fig7 = plt.figure()
origin_airport = order.index
mean_cancellation_rate = order
plt.bar(origin_airport, mean_cancellation_rate)

plt.xlabel('Origin Airport')
plt.ylabel('Cancellation Rate')
plt.title('Cancellation rates for the 30 busiest airports in USA')
plt.xticks(rotation=60)
plt.show()
In [16]:
### Cancellation rates for days of the week
order = Sample_Data.groupby(by=['DAY_OF_WEEK'])['CANCELLED'].mean().sort_values(ascending=False)

fig8 = plt.figure()
origin_airport = order.index
mean_canellation_rate = order
plt.bar(origin_airport, mean_canellation_rate)

plt.xlabel('Day of week')
plt.ylabel('Cancellation Rate')
plt.title('Cancellation rates for days of the week')
plt.xticks(rotation=30)
plt.show()

In the bar graph below the cancellations are shown for the 10 busiest airports. What is also shown are the reasons why the flights were cancelled. What directly can be concluded from the figure is the significant percentage of the flights that is canceld due to weather, espacially for O'Hare International Airport (ORD) and Dallas Forth Worth International Airport (DFW).

In [17]:
### Type of cancellations for 10 busiest airports in the USA
Airport_10 = Sample_Data['ORIGIN_AIRPORT'].value_counts().head(10).keys()
df_Airport_10 = Sample_Data[Sample_Data['ORIGIN_AIRPORT'].isin(Airport_10)]

def cancellation_reason_rate(airport, cancellation_reason):
    code_reason = df_Airport_10[(df_Airport_10['ORIGIN_AIRPORT'] == airport) & (df_Airport_10['CANCELLATION_REASON'] == cancellation_reason)].shape[0]
    return (code_reason/df_Airport_10.shape[0])

### Create different bars of bar plot
bar1 = [cancellation_reason_rate(airport, 'carrier') for airport in Airport_10]
bar2 = [cancellation_reason_rate(airport, 'weather') for airport in Airport_10]
bar3 = [cancellation_reason_rate(airport, 'NAS') for airport in Airport_10]
bar4 = [cancellation_reason_rate(airport, 'security') for airport in Airport_10]

### Naming of attributes
cancellation_reasons = ['Airline', 'Weather', 'Air system', 'Security']
bar12 = np.add(bar1, bar2).tolist()
bar123 = np.add(bar12, bar3).tolist()
r = [0,1,2,3,4,5,6,7,8,9]
 
names = list(Airport_10)
barWidth = 1

### Plotting of bar chart
fig=plt.figure()
plt.bar(r, bar1, color='#7DF9FF', edgecolor='white', width=barWidth)
plt.bar(r, bar2, bottom=bar1, color='#0096FF', edgecolor='white', width=barWidth)
plt.bar(r, bar3, bottom=bar12, color='#0000FF', edgecolor='white', width=barWidth)
plt.bar(r, bar4, bottom=bar123, color='#1434A4', edgecolor='white', width=barWidth)
 
plt.xticks(r, names, fontweight='bold')
plt.xlabel("Origin Airport")
plt.ylabel('Cancellation Rate')
plt.legend(cancellation_reasons) 
plt.title('Type of cancellations for 10 busiest airports in the USA')
plt.show()

Furthermore, the following three subjects are covered in more detail:

  • Airline delays; relationship between the airline and the delay
  • Airport delays; relationship between the airport and the delay
  • Weather delays; relationship between the weather and the delay

Airline delays¶

First, the delays in relation to the various airlines that operated flights in the United States in 2015 are investigated. The pie chart below shows the total delays of airlines within the USA in 2015. This figure doesn't consider how many flights an airline had. Therefore in the next figures, the delay (or specifically the Arrival delay) will be averaged out by flights per month and day to create a better insight into the performance of the airlines.

In [18]:
### For the labels of the pie chart a list is needed, so the DataFrame is converted to a list
Airline_Name = []
for i in range(len(df_airline_names)):
    Airline_Name.append(df_airline_names.iloc[i,0])

### Pie chart of airline delays in the USA in 2015
labels = Airline_Name
sizes = df_Airline['Average_Delay']
explode = (0,0,0,0,0,0,0,0,0,0,0,0,0,0.2)

fig1, ax1 = plt.subplots(figsize = (10,10))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=55)
ax1.axis('equal')

plt.title("Pie chart of total delays of airlines in the USA in 2015")
plt.show()
In [19]:
fig = px.bar(Months_Airline,y="Airline", x="Average arrival delay",animation_frame="Date",animation_group="Airline",color="Airline", range_x=[-20,100])
fig.update_yaxes(categoryorder="total ascending")
fig.update_layout(showlegend=False,
    title="Average arrival delay of airlines per month",
    xaxis_title="Average arrival delay per month [min/flight]",
    yaxis_title="Airlines",)
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 2000
fig.show()

For airlines the indicator of performance the average arrival delay has been chosen. This type of delay is the difference between what's on the passengers ticket and what time the plane actually arrives at.

As can be seen from the graph, Spirit Air Lines (NK) has the most delay on average in the year. Spirit Air Lines never leaves the top four, and leads the ranking of most delay within the month on average five times. The average yearly delay is 14.19 minutes per flight, which is significantly more than almost all the other airlines. The only airline that comes close to Spirit Air Lines, is Frontier Airlines (F9). With an yearly mean of 12.89 minutes delay per flight.

Noteworthy is Alaska Airlines (AS). This is the only airline with an negative average delay over the year. More flights are arriving early than late.

Note that US Airways (US) does not have any delays starting from Month 07 (July) going forward. This is due to that US Airways no longer operated flights after June. The same can be seen in the figure below from the last day of June (2015-06-30) going forward

In [20]:
fig = px.bar(Days_Airline,y="Airline", x="Average arrival delay",animation_frame="Date",animation_group="Airline",color="Airline", range_x=[-20,100])
fig.update_yaxes(categoryorder="total ascending") #set order ascending
fig.update_layout(showlegend=False,
                     title="Average arrival delay of airlines per day",
    xaxis_title="Average arrival delay [min/flight]",
    yaxis_title="Airlines",) #styling
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 2000 #set animation frame time
fig.show()

Airport delays¶

First of all, two airports will be covered: Hartsfield–Jackson Atlanta International Airport (ATL) and Seattle–Tacoma International Airport (SEA). The former is chosen as it is the biggest airport in the United States and because of its central location in the country. The latter is chosen as it also used as example in the weather delay analysis.

First, all flights for Atlanta Airport are shown. The thicker the line, the more flights there are on that connection. You can see for example that there are less flights to Hawaii (Honolulu International Airport) than to Boston (Logan International Airport) or LaGuardia Airport in New York. The second visualisation shows the average delay of the flights from Atlanta Airport in 2015. Here the width of the line shows the average delay per flight for that connection. What is remarkable, is that flights to the north have an higher average delay than flights to the south.

In [21]:
### Plot the number of flights from Atlanta Airport
fig = go.Figure()

fig.add_trace(go.Scattergeo(
    locationmode = 'USA-states',
    lon = df_airports['LONGITUDE'],
    lat = df_airports['LATITUDE'],
    hoverinfo = 'text',
    text = df_airports['AIRPORT'],
    mode = 'markers',
    marker = dict(
        size = 2,
        color = 'rgb(255, 100, 100)',
        line = dict(
            width = 3,
            color = 'rgba(100, 100, 100, 0)'
        )
    )))

flight_paths = []
for i in range(len(df_flights_ATL)):
    fig.add_trace(
        go.Scattergeo(
            locationmode = 'USA-states',
            lon = [df_flights_ATL['start_lon'][i], df_flights_ATL['end_lon'][i]],
            lat = [df_flights_ATL['start_lat'][i], df_flights_ATL['end_lat'][i]],
            mode = 'lines',
            line = dict(width = 1,color = 'red'),
            opacity = float(df_flights_ATL['cnt'][i]) / float(df_flights_ATL['cnt'].max()),
        )
    )

fig.update_layout(
    title_text = f'All flights from Atlanta in 2015',
    showlegend = False,
    geo = dict(
        scope = 'north america',
        projection_type = 'natural earth',
        showland = True,
        landcolor = 'rgb(200, 200, 200)',
        countrycolor = 'rgb(180, 180, 180)',
    ),
)

fig.show()
In [22]:
### Plot the average delays from Atlanta Airport
fig = go.Figure()

fig.add_trace(go.Scattergeo(
    locationmode = 'USA-states',
    lon = df_airports['LONGITUDE'],
    lat = df_airports['LATITUDE'],
    hoverinfo = 'text',
    text = df_airports['AIRPORT'],
    mode = 'markers',
    marker = dict(
        size = 2,
        color = 'rgb(255, 100, 100)',
        line = dict(
            width = 3,
            color = 'rgba(100, 100, 100, 0)'
        )
    )))

flight_paths = []
for i in range(len(df_flights_ATL)):
    fig.add_trace(
        go.Scattergeo(
            locationmode = 'USA-states',
            lon = [df_flights_ATL['start_lon'][i], df_flights_ATL['end_lon'][i]],
            lat = [df_flights_ATL['start_lat'][i], df_flights_ATL['end_lat'][i]],
            mode = 'lines',
            line = dict(width = 1,color = 'red'),
            opacity = float(df_flights_ATL['avg_delay'][i]) / float(df_flights_ATL['avg_delay'].max()),
        )
    )

fig.update_layout(
    title_text = f'Average delay of flights from Atlanta in 2015',
    showlegend = False,
    geo = dict(
        scope = 'north america',
        projection_type = 'natural earth',
        showland = True,
        landcolor = 'rgb(200, 200, 200)',
        countrycolor = 'rgb(180, 180, 180)',
    ),
)

fig.show()

The same thing is done for Seattle–Tacoma International Airport. The flights map shows that there a lot of flights going to the bigger airports in California, like San Francisco International Airport and Los Angeles International Airport, and that there are less flights going to the East Coast, especially Florida. The second map shows that the average delay for all these flights is quite similar, regardless of the destination.

In [23]:
### Plot the number of flights from Seattle Airport
fig = go.Figure()

fig.add_trace(go.Scattergeo(
    locationmode = 'USA-states',
    lon = df_airports['LONGITUDE'],
    lat = df_airports['LATITUDE'],
    hoverinfo = 'text',
    text = df_airports['AIRPORT'],
    mode = 'markers',
    marker = dict(
        size = 2,
        color = 'rgb(255, 100, 100)',
        line = dict(
            width = 3,
            color = 'rgba(100, 100, 100, 0)'
        )
    )))

flight_paths = []
for i in range(len(df_flights_SEA)):
    fig.add_trace(
        go.Scattergeo(
            locationmode = 'USA-states',
            lon = [df_flights_SEA['start_lon'][i], df_flights_SEA['end_lon'][i]],
            lat = [df_flights_SEA['start_lat'][i], df_flights_SEA['end_lat'][i]],
            mode = 'lines',
            line = dict(width = 1,color = 'red'),
            opacity = float(df_flights_SEA['cnt'][i]) / float(df_flights_SEA['cnt'].max()),
        )
    )

fig.update_layout(
    title_text = f'All flights from Seattle in 2015',
    showlegend = False,
    geo = dict(
        scope = 'north america',
        projection_type = 'natural earth',
        showland = True,
        landcolor = 'rgb(200, 200, 200)',
        countrycolor = 'rgb(180, 180, 180)',
    ),
)

fig.show()
In [24]:
### Plot the average delays from Seattle Airport
fig = go.Figure()

fig.add_trace(go.Scattergeo(
    locationmode = 'USA-states',
    lon = df_airports['LONGITUDE'],
    lat = df_airports['LATITUDE'],
    hoverinfo = 'text',
    text = df_airports['AIRPORT'],
    mode = 'markers',
    marker = dict(
        size = 2,
        color = 'rgb(255, 100, 100)',
        line = dict(
            width = 3,
            color = 'rgba(100, 100, 100, 0)'
        )
    )))

flight_paths = []
for i in range(len(df_flights_SEA)):
    fig.add_trace(
        go.Scattergeo(
            locationmode = 'USA-states',
            lon = [df_flights_SEA['start_lon'][i], df_flights_SEA['end_lon'][i]],
            lat = [df_flights_SEA['start_lat'][i], df_flights_SEA['end_lat'][i]],
            mode = 'lines',
            line = dict(width = 1,color = 'red'),
            opacity = float(df_flights_SEA['avg_delay'][i]) / float(df_flights_SEA['avg_delay'].max()),
        )
    )

fig.update_layout(
    title_text = f'Average delay of flights from Seattle in 2015',
    showlegend = False,
    geo = dict(
        scope = 'north america',
        projection_type = 'natural earth',
        showland = True,
        landcolor = 'rgb(200, 200, 200)',
        countrycolor = 'rgb(180, 180, 180)',
    ),
)

fig.show()

Where in the maps above two of the bigger airports in the United States were visualised, the analysis below is done for all airports in the United States. The average delay of the airports per month and per year are visualised.

In [25]:
fig = px.bar(Months_Airport,y="Airport", x="Average departure delay",animation_frame="Date",animation_group="Airport",color="Airport",range_y=[607.5,627.5],range_x=[0,140])
fig.update_yaxes(categoryorder="total ascending")
fig.update_layout(showlegend=False,
                     title="Average departure delay of airports per month [Top 20]",
    xaxis_title="Average departure delay per airport [min/flight]",
    yaxis_title="Airport",)
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 2000
fig.show()

In the figure above the departure delay per airport is displayed. The departure delay per airport is taken, as the departure on an airport is mainly influenced by the airport itself. In the figure only the first twenty airports per month are showed. This is due to the long list of airports included within the dataframe. With the higher number of airports and a heavy switching top 20 it is hard to see which airport are performing bad regarding departure delay. Therefore, to give an conclusion on this subject, the yearly average will be advised. This can be seen in the figure below. The yearly average gives the following top 5 in most departure delay:

  1. Trenton-Mercer Airport (TTN), New Jersey - 46.8 minutes per flight
  2. Wilmington International Airport (ILM), North Carolina - 36.9 minutes per flight
  3. Northeast Florida Regional Airport (UST), Florida - 33.0 minutes per flight
  4. Santa Maria Public Airport (SMX), California - 31.8 minutes per flight
  5. Unknown Airport 11471 - 31.3 minutes per flight

It shows that none of the top 20 airports in the graph below are among the top 100 biggest airports of the United States. So it is concluded that the biggest airports do not have the biggest delays on average.

In [26]:
fig = px.bar(Year_DP,y="Airport", x="Average departure delay",animation_frame="Date",animation_group="Airport",color="Airport",range_y=[607.5,627.5],range_x=[0,140])
fig.update_yaxes(categoryorder="total ascending")
fig.update_layout(showlegend=False,
                     title="Average departure delay of airports [Top 20]",
    xaxis_title="Average departure delay per airport [min/flight]",
    yaxis_title="Airport",)

fig.show()

These findings can be related to the maps where all flights with 6 hour or 24 hour delay were shown.

To identify whether these bad indicators for airports are caused by a small group of flights or that it is a continuous problem. The flights with a delay of more than 6 and 24 hours delay are identified and plotted below. What can be seen is that Trenton-Mercer Airport (TTN) the worst performing airport based on departure delay does not show up within the flights of more than 24 hours delay. With 6 hours delay or more flights, Trenton only has around 10 flights with more than 6 hours of delay. Hereby it can be indicated that Trenton has an overall performance issue. The same can be said for the second worst performing airport Wilmington International Airport (ILM), with an even lower amount of more than 6 hours delay flights.

Weather delays¶

Now that sufficient information has been found regarding the delay over the various airports, the relationship between this delay and the weather will be examined. Whether the amount of precipitation and temperature affects delay was investigated in several steps. To be able to answer the last sub-question and, in addition, to confirm or reject the hypothesis, which stated that weather conditions actually affect delay.

In [27]:
### Plot temperature and precipitation of airports
fig = px.scatter_geo(df_location, lat="lat", lon="lon", color="actual_precipitation", size="actual_mean_temp", color_continuous_midpoint=6, range_color=[0,200],
                  color_continuous_scale='pubu', projection="albers usa", scope="usa", labels={"actual_precipitation": "Actual precipitation", "date": "Date"},
                     animation_frame="date")

fig.update_layout(geo=dict(landcolor='rgb(0,0,0)'), title= 'Temperature (size) and precipitation (colour) of 7 airports in the USA ')
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 50
fig.show()

To start with the visualisation above, where a map of the United States is shown. The figure is generated using weather data that includes information for the first 6 months of the year 2015. The data file contains, among other things, the amount of precipitation and the temperature for 10 different airports in America for each day. The weather stations are visualised with circles. Each day within the first half year of 2015 these circles change colour and size. The colour gives the actual precipitation and the size increases with the temperature. A white circle is equal to no precipitation, and the bluer the circle, the more rain has fallen that day. This gives a first impression of the amount of rainfall and differences in temperature.

Secondly, two different graphs below present the precipitation and temperature at Seattle-Tacoma International Airport and George Bush Intercontinental Airport of Houston per day. In these graphs all days with a precipitation of above 20 inches are highlighted. For Seattle as well as for Houston, the temperature increases as the year progresses. In addition, both locations clearly have days with a lot of rain. Which could therefore have an effect on air traffic.

In [28]:
### Plot temperature and precipitation of Seattle airport
from plotly.subplots import make_subplots

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Bar(x=df_KSEA.date, y=df_KSEA.actual_precipitation, name='Actual precipitation'),
    secondary_y=False)

fig.add_trace(go.Scatter(
    x=df_KSEA[df_KSEA.actual_precipitation > 20].date, y=df_KSEA[df_KSEA.actual_precipitation > 20].actual_precipitation,
                    mode='markers', name='Precipitation above 20 inches'))

fig.add_trace(go.Scatter(x=df_KSEA.date, y=df_KSEA.actual_mean_temp, name="Actual mean temp"),
    secondary_y=True)

# Add figure title
fig.update_layout(
    title_text="Temperature and precipitation Seattle Airport"
)

# Set x-axis title
fig.update_xaxes(title_text="Date")

# Set y-axes titles
fig.update_yaxes(title_text="Precipitation [Inches]", secondary_y=False)
fig.update_yaxes(title_text="Temperature [F]", secondary_y=True)

fig.show()
In [29]:
### Plot temperature and precipitation of Houston airport
from plotly.subplots import make_subplots

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Bar(x=df_KHOU.date, y=df_KHOU.actual_precipitation, name='Actual precipitation'),
    secondary_y=False)

fig.add_trace(go.Scatter(
    x=df_KHOU[df_KHOU.actual_precipitation > 20].date, y=df_KHOU[df_KHOU.actual_precipitation > 20].actual_precipitation,
                    mode='markers', name='Precipitation above 20 inches'))

fig.add_trace(go.Scatter(x=df_KHOU.date, y=df_KHOU.actual_mean_temp, name="Actual mean temp"),
    secondary_y=True)

# Add figure title
fig.update_layout(
    title_text="Temperature and precipitation Houston Airport"
)

# Set x-axis title
fig.update_xaxes(title_text="Date")

# Set y-axes titles
fig.update_yaxes(title_text="Precipitation [Inches]", secondary_y=False)
fig.update_yaxes(title_text="Temperature [F]", secondary_y=True)

fig.show()

Next step is looking to the exact relationship between the delay and the precipitation. The correlation between average delay per day is compared to the rainfall of that day for both airports. Both graphs below do not show a clear correlation, this is also reflected in the correlation coefficient. A correlation coefficient is a number between -1 and 1, that tells how similar the measurements of the two variables are across the dataset. Both values show a moderate strong relationship, which indicates that there is a kind of relation between the two variables, but bad weather will not directly result in extra delay. The correlation coefficient for Seattle is 0.39, for Houston it is 0.43.

In [30]:
### Plot average delay per day over rainfall for Seattle
x = []
y = []
for i in range(len(df3_SEA)):
    x.append(df3_SEA.Rainfall.iloc[i])
    y.append(df3_SEA.Average_Delay.iloc[i])

fig = px.scatter(x=x, y=y)
fig.update_layout(xaxis=dict(title="Rainfall [Inches]"), yaxis=dict(title="Average Delay [min]"))
fig.update_layout(
    title_text="Actual precipitation Seattle Airport")
fig.show()

### Compute R-squared value
Rainfall = []
Average_Delay = []
for i in range(len(df3_SEA)):
    Rainfall.append(df3_SEA.Rainfall.iloc[i])
    Average_Delay.append(df3_SEA.Average_Delay.iloc[i])

print(f'The square of the correlation between rainfall and the average delay is equal to: {np.corrcoef(Rainfall, Average_Delay)[1,0]:.2f}')
The square of the correlation between rainfall and the average delay is equal to: 0.39
In [31]:
### Plot average delay per day over rainfall for Houston
x = []
y = []
for i in range(len(df3_HOU)):
    x.append(df3_HOU.Rainfall.iloc[i])
    y.append(df3_HOU.Average_Delay.iloc[i])

fig = px.scatter(x=x, y=y)
fig.update_layout(xaxis=dict(title="Rainfall [Inches]"), yaxis=dict(title="Average Delay [min]"))
fig.update_layout(
    title_text="Actual precipitation Houston Airport")
fig.show()

### Compute R-squared value
Rainfall = []
Average_Delay = []
for i in range(len(df3_HOU)):
    Rainfall.append(df3_HOU.Rainfall.iloc[i])
    Average_Delay.append(df3_HOU.Average_Delay.iloc[i])

print(f'The square of the correlation between rainfall and the average delay is equal to: {np.corrcoef(Rainfall, Average_Delay)[1,0]:.2f}')
The square of the correlation between rainfall and the average delay is equal to: 0.43

To end with, the delays are plotted per day with the precipitation for Houston as this airport showed the highest correlation. Furthermore, a visualisation is given with the total delays for every day: the darker the red colour, the more delays there were on that day. In this graph it can be seen that on the days with more delays (dark red colour), there is also a bit more rainfall. From the previous graph with the correlation, it was already concluded that there is some correlation between rainfall and delay but not that much, which also can be seen in the bar graph below.

In [32]:
### Plot the average delay and precipitation at Houston Airport over the year in a bar chart
fig = plt.figure()

ax = fig.add_subplot(111)
ax2 = ax.twinx()

df3_HOU.Rainfall.plot(kind='bar', color='red', ax=ax, position=1)
df3_HOU.Average_Delay.plot(kind='bar', color='blue', ax=ax2, position=0)

ax.set_ylabel('Rainfall [inches]')
ax2.set_ylabel('Average Delay [minutes]')

plt.title(f'Delay and rainfall per day at Houston Airport in 2015 ')

### Set the labels on the x-axis
n=10
ticks = ax.xaxis.get_ticklocs()
ticklabels = ax.xaxis.get_ticklabels()

ax.xaxis.set_ticks(ticks[::n])
ax.xaxis.set_ticklabels(ticklabels[::n])

ax.legend(loc=0)
ax2.legend(loc=1)

plt.xlim(50,150)
plt.show()
In [33]:
calmap.calendarplot(df4_HOU, fig_kws={'figsize': (9.5,4)}, yearlabel_kws={'color':'black', 'fontsize':14}, subplot_kws={'title':'Delays at Houston Airport'});
C:\Users\woute\anaconda3\envs\TIL6022\lib\site-packages\calmap\__init__.py:202: FutureWarning:

In a future version of pandas all arguments of DataFrame.pivot will be keyword-only.

C:\Users\woute\anaconda3\envs\TIL6022\lib\site-packages\calmap\__init__.py:206: FutureWarning:

In a future version of pandas all arguments of DataFrame.pivot will be keyword-only.

Conclusion¶

At the start of the document, in addition to the main question, five sub-questions were identified. These have been addressed throughout the document using visualizations. The main question was stated as below:

What were the main air traffic delays for domestic flights in the United States over 2015?

To answer this research question, an extensive data file was used and explained through various visualizations. After correctly importing all the data and incorporating it into the document, the document was constructed based on the following components: Introduction, reasons for delays, airline delays, airport delays in general and at the end the relation to weather is expressed.

First of all, looking at the distribution over the week, it becomes clear that apart from Saturday, the flights are gradually distributed over the days of the week. With Saturday having slightly fewer flights. Regarding the duration of the flights, most of the flights have a duration of 50 to 100 minutes, with some outliers from more than 500 minutes.

The datafile distinguishes 4 types of delay; Weather delay, Air System delay, Security delay and Late Aircraft delay. In which, it is noticeable that air system delay and late aircraft delay occur more frequent than the weather delay and security Delay. Besides that flights can have departure and/or arrival delays, they also can be cancelled completely. The cancellation percentage differs across the various airports from 0.5 percent to 4 percent.

Related to airlines, Spirit Air Lines (NK) has on average a significant amount of more delay in comparison with the other airlines, for the year 2015. In addition, the visualization with respect to the ranking of airline delays shows that NK airline does not leave the top 4 in all months. The only airline that comes close to Spirit Air Lines, is Frontier Airlines (F9). Noteworthy is Alaska Airlines (AS), which is the only airline with a negative average delay over the year.

Regarding airports, a ranking over the months of 2015 was given as well. However, it was difficult to draw a conclusion from this, due to large fluctuations. That is why the average departure delay per airport were investigated, which resulted in the following top 5:

  1. Trenton-Mercer Airport (TTN), New Jersey - 46.8 minutes per flight
  2. Wilmington International Airport (ILM), North Carolina - 36.9 minutes per flight
  3. Northeast Florida Regional Airport (UST), Florida - 33.0 minutes per flight
  4. Santa Maria Public Airport (SMX), California - 31.8 minutes per flight
  5. Unknown Airport 11471 - 31.3 minutes per flight

Interestingly, the biggest airports and the biggest airlines did not have the biggest delays. Therefore the hyposthesis was incorrect.

To end with, the correlation between weather data and the delays is examined. Using the visualizations, it can be concluded that although there is a relationship between weather and airport delays, the relationship is not strong. So, a rainy day in 2015 did not guarantee delays on the airports of America.

In conclusion, the main air traffic delays for domestic flights in the United States over 2015 where caused by air systems and late aircrafts. Moreover, on average the longeste delays were visible within Trenton-Mercer Airport (TTN) airport in New Jersey and in addition, Spirit Air Lines (NK) had the most delays regarding airlines.

In [ ]: